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Abstract 

We consider communication when there is no agreement about symbols and mean- 
ings. We treat it within the framework of reinforcement learning. We apply dif- 
ferent reinforcement learning models in our studies and simplify the problem as 
much as possible. We show that the modelling of the other agent is insufficient in 
the simplest possible case, unless the intentions can also be modelled. The model 
of the agent and its intentions enable quick agreements about symbol-meaning 
association. We show that when both agents assume an 'intention model' about 
the other agent then the symbol-meaning association process can be spoiled and 
symbol meaning association may become hard. 

1 Introduction 

The emergence of communication is one of the most enigmatic problems for 
several disciplines including evolution, natural language theory, information 
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technology. For a recent collection of papers, see, e.g., (jjj) . For proper 
treatment, the concept of communication needs to be considered. First, let 
us see a few examples. 

Smoke signals. These are 'few bit' signals that could mean attention, 
danger, help, and so on. The vocabulary is is small, the communication 
speed is high, the communication distance is large. The primary goal of 
this communication is to overcome limited observation capabilities of other 
agents, to warn, and to coordinate future actions. 

Atomic interactions. Not all light enabled interaction is, however, com- 
munication: Atoms, for example, interact each other by exchanging photons. 
The emission and the absorption of photons are not intentional and the 
transmitted photon has no hidden meaning. 

Grooming. According to Dunbar, grooming between monkeys is used, 
for example, to form alliances, serve, or apologize (Q). Thus, we consider 
grooming communication, although it is non-verbal communication. 

Then, the common features of communication are as follows: (i) com- 
munication is optional, (ii) it is intentional, and (iii) communicated signals 
are symbols of certain meanings. Further, (iv) communication is successful, 
if the meaning is the same for those who communicate. The emergence of 
communication is the subject of evolutionary linguistics (for a recent review 
on evolutionary linguistics, see Ijl2h). Evolutionary linguistics focuses on the 
selective scenario that might give rise to the appearance of early languages. 
There are many theories and many possibilities. Let us consider the popu- 
lar and efficient language game approach (13; BlJ @; 0). In language games, 
the theoretical approach makes certain assumptions. Presupposed conditions 
include the following: agents interact and their interaction is 'coordinated'. 
Thus, language game presupposes the existence of an agreement that agents 
start to engage themselves in 'coordinated actions'. Such an agreement is also 
a symbol-meaning association. Thus, the language game approach assumes 
existing symbol-meaning association and builds on that assumption. 

Our question concerns the very minimum of symbol-meaning association 
needed for successful communication. To this end, we make the problem as 
simple as possible. Our analysis is embedded into the framework of reinforce- 
ment learning. We study how communication may depend on the presence 
or the absence of the communication of emotions or internal values. In our 
simulations, communication will emerge as a deliberate action of the agents, 
but only if certain conditions are fulfilled. 

The paper is organized as follows. We provide the theoretical analysis in 
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Section El This analysis shows the necessity of emotional coupling between 
agents. We illustrate the analysis with simulations in simple scenarios (Sec- 
tion [3]) . We shall discuss our results in Section We conclude in Section [5l 
The paper is understandable without involved mathematical tools. Mathe- 
matical details are presented in the Appendices for the sake of completeness. 



2 Theoretical analysis 

In this section, we investigate conditions when communication between two 
autonomous agents can emerge. The question is, how two autonomous agents 
could learn to communicate? We assume that neither the meanings nor the 
communication signals are fixed in advance, there is no special method of 
negotiation, and there is no will for communication. However, the possibility 
for communication is given, and the world is such that communication could 
be advantageous. 

We investigate the problem in the framework of reinforcement learning 



(for an excellent introduction, see (11 In). We investigate how communication 



may emerge from joint problem solving. That is, we ask how agents could 
learn when and what to communicate based on utility; how they could learn 
to emit and interpret signals provided that both parties benefit from those. 
It is surprising that if communication has a cost, then it is still not sufficient 
that 

• the possibility of communication is given and 

• communication would be beneficial for both agents (even with costs). 

The underlying reason is that we have assumed that none of the agents 
has fixed an interpretation of the signals, therefore they have to learn simul- 
taneously the translation from meanings to signals and vice versa. Let us call 
one of the agents, that wishes to communicate something the 'speaker', and 
the other one, which should learn to interpret it, the 'listener'. Now consider 
the case, when both agents are in the learning phase, and the speaker exper- 
iments with different signals to express different meanings. The listener may 
not be able to differentiate the meanings, and because of the costs, stops 
listening (i.e. learns that it is not worth to communicate). This effect ap- 
pears already in the simplest possible case. In this case, behaviors can be 
computed analytically. 
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Consider two agents, A and B. For the sake of simplicity, we assume that 
communication is one-directional: A may speak and B may listen to it. In 
each episode, agent A may either be in state "1" or "2" (with equal probabil- 
ity), and has three possible actions: communicate "X", communicate "Y", 
and do not communicate. Communication has a cost of 1 > ca > 0. Agent 
B may listen to the signal of A for a cost of 1 > c B > 0, and has to guess the 
state of A (say "1" or "2"). They both receive a reward of +1, if the guess 
is correct and a penalty of —1 if not. Since the cost of communication is 
less than the reward obtainable by it, communication is desirable, if the two 
agents are able to agree that saying "X" means one of the states and saying 
"Y" means the other. 

The policy of A can be described by the triple M A = (a,pi,p 2 ), where a 
is the probability that A will communicate something, p\ is the probability 
that A says "X" in state "1", 1—pi is the probability that A says "Y" in state 
"1" given that he decides to communicate, and P2 is the probability that A 
says "X" in state "2", 1 — p 2 is the probability that A says "Y" in state "2" 
given that he is communicating. Similarly, the policy of B can be described 
by the triple Mb = (P,qx,qy), where (3 is the probability that B will listen 
to the signal, qx is the probability that B guesses "1" after hearing "X", 
1 — qx is the probability that B guesses "2" after hearing "X" given that he 
listens, and qy is the probability that B guesses "1" after hearing "Y", 1 — qy 
is the probability that B guesses "2" after hearing "Y" given that he listens. 
The probabilities and rewards for the case when A talks and B listens are 
summarized in Figure [TJ If B does not listen, or A does not talk, then B 
guesses "1" with probability 0.5. 

It is easy to calculate, that if both of them communicate, the common 
part of their expected reward is (pi — p 2 )(qx — qy), and if any of them is 
not communicating. Thus, the expected reward for policies Ma and Mb is 

R A (M A , Mb) = ol ■ {-c A ) + 2ap{ Pl - p 2 )(q x - q Y ) 

for agent A and 

R B (M A , M b )=/3- (-c b ) + 2a(3( Pl - p 2 )(q x - qy). 

for agent B. We have assumed that neither A nor B can bind meanings to 
signals, so initially p\ ~ p 2 and qx — qy- Let us investigate the learning 
process of agent A. If \qx — qy\ < e (B cannot distinguish well between 
meanings), the cost term of A will be greater than his reward term, so (i) 
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of A of A of B 




Fig. 1: Various outcomes and associated rewards 



he cannot tune p\ and p% reliably (their gradient is small), and (ii) he can 
minimize his losses by lowering a. The exact value of e depends on the cost 
of communication. Similarly, B will try to minimize (3 until A does not learn 
to distinguish between concepts, and cannot reliably tune qx and qy. 

As a result, during early trials, pi, P2, qx and qy can only change stochas- 
tically, by random walk. As the cost of communication grows, so does e, and 
the time needed to exceed this limit by random walk grows exponentially. 
However, during this time, a and (3 keep diminishing. So by the time A and 
B could (by chance) break the symmetry, and learn the distinction of mean- 
ings, they will learn that communication is not useful. We note that in the 
general case, knowing the other agent's dynamics (the parameter sets (p\, 
P2, a) and (qx, qy, ft)) does not always help; e.g., if the reward of one agent 
is not available to the other agent and vice versa, or, if the rewards of the 
agents do not depend on each other's behaviors. In our two-state example 
behaviors are coupled. Then, in theory, agents could use certain methods 
to estimate the hidden reward function of the other agent. For example, 
non-direct implicit estimation is accomplished by the general policy gradient 
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method: this method - up to some extent - overcomes partial observations. 
It is so, because individual trajectories are considered in this case. Successful 
estimation is, however, highly improbable in sophisticated real life situations. 

Main hypothesis 

Within the framework of reinforcement learning, we have a single means to 
cure the flaw described previously; the agents should be able to model each 
other's intentions; the dynamics and the 'goals' of the other agent. This is 
possible if the values Ra and/or R B are made available to them. 

Now, the situation becomes different: agent A can optimize Ma for a 
fixed M B . Although agent A cannot modify the policy of 5, he can model, 
what would be rewarding for agent B. Furthermore, he considers the optimal 
combination of the Ma and Mb strategies. Let us see the possible scenarios: 

1- step modelling 

Optimizing Ma for a fixed M B means calculating the conditional strategy 
M A \b(M b ) = axgmax R A (M A ,M B ), 

M A 

that is, A can calculate, that if B followed M B , what would be the optimal 
choice for himself, i.e., for A. 

2- step modelling 

If agent A 'knows' that he is using the conditional 1-step modelling strategy 
about agent B, then he might as well suppose that B does the same, i.e., 
agent A might suppose that the strategy of agent B is the following: 

M b \a{M a ) = axgmax R B (M A ,M B ), 

M B 

Now, agent A can simply choose his optimal strategy: 
M* A = arg max R A (M A , M BlA (M A )). 

Ma 

It might be worth noting that this abstract problem phrasing goes beyond 
the problem of communication; it is a general learning problem. If an agent 
does something and it is visible to the other agent then it is a signal, which 
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is dependent on the state of the first agent. If both agents are learning, then 
the situation becomes similar to our simplified example on communication. 

Now, we introduce the basic concepts of reinforcement learning. Then we 
turn to the illustrative numerical experiments. 

2.1 Basic concepts of reinforcement learning 

Reinforcement learning aims to solve behavior optimization based on imme- 
diate rewards. The main goal of optimization is to maximize the long-term 
discounted and cumulated reward, the value, or return during the decision 
making process. Reinforcement learning problems may be solved by value 
function estimation or by direct strategy (policy) search methods. In value 
function estimation, states or state-action pairs are assigned value estimates 
that reflect the expected value of the long term cumulated and possibly dis- 
counted reward of choosing them. The agent is not greedy and may not 
choose the optimal immediate reward, but it tries to act greedily according 
to this value function: he selects the next state or action, which promises the 
optimal long-term (discounted and) cumulated reward also called return. 

It is known that in partially observed environment, like in our case when 
the internal states of the agents may not be observed, the direct policy search 
method can be more efficient (jl). In this case, the policy of the agent is 
explicitly represented in a parameterized form, and the parameters are up- 
dated so that the described policy becomes optimal from the point of view of 
the return. Policy gradient methods maximize the expected return by using 
gradient methods. The gradient of the return function can be calculated ex- 
plicitly if the return function is known (see Appendix [A]). However, general 
methods also exist for cases when the reward function is not known explicitly 
(Appendix [Bj). 

Here, we shall present numerical results for these methods. Note that in 
our simplified problem the immediate reward and the long-term reward are 
identical. More sophisticated situations show the same phenomena |3). 

3 Computer experiments 

We have tested our theoretical analysis by conducting numerical experiments. 
We used policy gradient methods, and various methods where the agents 
modelled each other. We studied the following cases: 
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Method 1: The agents did not model each other. In this case we studied 
value based methods and explicit policy gradient method. We present 
results for explicit policy gradient method. 

Method 2: The agents did not model each other directly, but use the general 
policy gradient method. This method models the world and thus the 
other agent implicitly. 

Method 3: Agent A estimated agent B's dynamics, i.e. the parameters that 
determine B J s policy. In this case, agent A used a 1-step model of agent 
B. Thus, in this model, agent A senses the rewards of agent B and 
chooses the optimal policy accordingly. Agent B did not model agent 
A and applied the policy gradient method 

Method 4: Both agent A and agent B had access to the rewards of the other 
agent and estimated each other's dynamics. Both agents used a 1-step 
model of each other to choose their optimal policy 

Method 5: Both agent A and B had access to the rewards of the other agent 
and estimated each other's dynamics. Agent A used a 2-step model of 
B, agent B used a 1-step model of A to choose an optimal policy 

Method 6: Both agent A and B had access to the rewards of the other agent 
and estimated each other's dynamics, and used a 2-step model of each 
other to choose their optimal policy 

In the experiments, the values a and f3 were initialized to 0.75 in all cases. 
This choice enables us to compare the different methods. The values are 
high enough to give a fair amount of chance for the agents at the beginning 
to utilize communication. The values pi,P2, Qx, Qy were initialized randomly 
according to the uniform distribution in the range [0.4,0.6]. 

In the computational studies we averaged 1000 runs. In each run we 
had at most 1000 learning episodes. In each episode an action was made 
by agent A and a reaction, i.e., a guess, was made by agent B. Learning 
was considered successful if after a certain number of steps, trials were 100% 
successful; the reward in each of the next 100 trials was +1. The number of 
steps needed for successful communication (not including the 100 successful 
ones that are used for measuring success rate) is the time needed for the 
agreement. Figure [2] depicts the success rate for the different methods. 
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Fig. 2: Performance of the various methods as a function of the cost of 
communication. Shorthand "vs": versus. For example, 2-step vs 1-step: one 
agent used a 1-step model, the other agent used a 2-step model. 



The general policy gradient method (Appendix [Bj is superior to the ex- 
plicit policy gradient method (Appendix lAj) . however, if each agents uses 
these methods then they will not learn to communicate if communication 
cost is high. Value estimation based reinforcement learning methods seem to 
be the weakest amongst all methods that we studied (results are not shown 
here). Methods where agents use 1-step or 2-step models are sometimes 100% 
successful, with a single notable exception: if both agents use 2-step mod- 
els then success rate is only about 50%. When rewards of the other agents 
are available then value estimation based method (the SARSA method (0)) 
succeeds, too. 

It can be seen, that when agents do not model each other, the chance that 
they learn to communicate decreases as the cost of communication increases. 
However, when agents model each other, they are able to learn that com- 
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munication is useful even when the cost is high, with the peculiar exception 
when both agents use 2-step models. 

Figure [3] depicts the time needed to reach an agreement. Situations when 
agreement was not reached are excluded from these statistics. It can also be 




Fig. 3: Learning time for various methods as a function of the cost of com- 
munication. Averages include only successful learning cases. Shorthand gen: 
general, pol: policy, grad: gradient 

seen, that when both agents can model the rewards of the other agent, then 
agreement about the signal-meaning association is fast. This is so, because 
they shortcut the slow tuning procedure of reinforcement learning. If this 
shortcut is not applied, like in the case of the value estimation based SARSA 
method, agreement can be still reached, but only very slowly. When one of 
the agents thinks two steps ahead, agreement is even faster. In this case, 
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agreement is accomplished in 1 step after an initial transient of 10 steps 
when the agents estimate each others' parameters. When both agents try to 
think two steps ahead and agreement is only achieved in 50% of the cases, 
agreement - if it occurs - is very fast. Thus, if agreement is not reached 
quickly, then agents could suspect that the second-order intentional model 
(e.g., one agent assumes that the other agent uses a 1-step model) is not 
valid. 



4 Discussion 

Theory of reinforcement learning shows that globally optimal solutions can 
be learned 'easily' under strict conditions. The relevant condition for us is 
the Markov condition: information from the past does not help in improving 
decisions. In other words, every information is encoded into the actual state 
of the agent and all state variables are amenable to the agent for acting and 
learning. If this condition together with some other technical assumptions 
are fulfilled, then the learning task is called Markov decision problem (MDP, 



see, e.g., (I 111 ) and the references therein). 

The Markov condition is hardly met in real life. It is not met in our case 
either, because the parameters of decision making of agent A (or B) (i) are 
subject to experiences of agent A (or B), i.e., they depend on the history, (ii) 
these parameters are not available for agent B (or A), and (iii) agent B (or 
A) would benefit from knowing these parameters. In this case the world is 
only partially observed and task is called partially observed Markov decision 
problem (POMDP) (see, e.g., (0) and references therein). 

This lack of information can be eased by modelling the other agent. The 
other agent might have many variables and a large subset of those variables 
can be modelled by different means. We demonstrated this by using policy 
gradient methods. Both the explicit policy gradient and the general pol- 
icy gradient method develop models of the 'private' parameters of the other 
agent: they model the state-action mapping, that is, the policy of the other 
agent. The modelling process can be explicit: a particular model is assumed 
in this case, or implicit, when there is a general parametrization in the policy 
gradient. Our simulations demonstrate that the performance of the explicit 
policy gradient model is inferior to that of the general model. This obser- 
vation can be traced back to the differences between the methods: general 
policy gradient makes direct use of the immediate rewards, deals with indi- 
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vidual state-action sequences separately. Thus, the general policy gradient 
method - up to some extent and indirectly - takes into account the intentions 
of the other agent. In the case of the model based explicit policy gradient 
method this connection is highly remote: the same information enters the 
computation only after expected value computation. Value estimation based 
methods (not shown here) have the same drawback and they are also infe- 
rior to the general policy gradient method. These notes concern our simple 
scenario that does not fulfill the conditions of MDPs. 

We have shown that the lack of a single quantity, the reward, makes a 
huge difference: not having access to the reward of the other agent, the emer- 
gence of communication can be seriously limited if communication involves 
cost. The assumption that communication is costly seems realistic, because 
communication takes time. Without access to the rewards of the other agent, 
the higher the cost, the sooner the agents learn that communication is useless. 

There are several exceptions to this simple observation. For example, if 
the policy of one of the agents is steady (i.e., this agent is not learning), 
then this agent will act effectively as the teacher and the adaptive agent can 
learn either the appropriate signal (if he is the speaker) or the appropriate 
meaning (if he is the listener). 

The problem arises if the learning rates of the two agents are about the 
same. Then, to develop a successful communication, they should be able to 
sense and then model (implicitly or explicitly) the immediate rewards, or the 
cumulated rewards of the other agent. We shall call this capability emotional 
intelligence. It is satisfactory if one of the agents has that capability. If 
an agent has emotional intelligence then the learning of symbol-meaning 
association may become very efficient. 

There are many ways to make this learning efficient, depending on what 
the agents assume about their partner. Consider, for example, that both 
agents have emotional intelligence and both agents use this emotional intelli- 
gence when they learn to communicate. Now, it makes a huge difference how 
they use the emotional information they have. For example, indirect mod- 
elling of the situation occurs if we assume that the agents receive the same 
reward. Then we are in the MDP domain and we can apply MDP methods 
such as SARSA (p)) - without directly modelling the other agent - safely. 

A large improvement was gained if both agents considered what is the 
best to them. Further, if (only) one of the agents used that information 
to 'anticipate' what the other agent might prefer to do in the next step, in 
reaction to his action, then learning became even faster - as it was expected 
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from theory (Section [2]). 

However, learning is severely spoiled if both agents are clever enough 
and anticipate the next step of the other agents. This has the following 
explanation: both agents suppose that the other is using a 1-step model 
to model him, which, in this case, is false, because both agents use 2-step 
models. In this situation, in 50% of the cases the randomly generated initial 
parameters allow to reach an agreement just by chance. In the other 50% no 
agreement is reached. 

As we have noted earlier, in this peculiar case the agents could suspect 
that the 1-step model they use about the other agent is false: the other 
agent also considers 'what is on his partner's mind'. Such consideration 
are the starting points of game theory. However, the situation here can be 
different from game theory. In principle, our agents can expect very fast 
agreement and they can become frustrated because of the lack of this quick 
agreement. Our agents are also emotionally intelligent and they might sense 
the frustration of the other agent. That is, our agents might note that their 
models are not valid and might come to a joint agreement. Thus, in our case, 
agents may use higher-order intentional models and they will succeed. 

An advantage of our formulation is that the agent might decide if he 
wants to optimize the sum of the two rewards (cooperative agent), his own 
reward (selfish agent) , the reward of the other agent no matter how much it 
costs (altruistic agent), might decide to change this choice, and so on. These 
situations call for further investigations. 

In our simple example, the immediate reward and the long-term reward 
were identical. Situations, where these two quantities are different have also 
been studied |1). The observations are about the same as in the simple case 
that we presented here. 

5 Conclusions 

We have used explicit and implicit models in reinforcement learning. The 
world was partially observed, but otherwise it was simplified as much as pos- 
sible: we used two agents, two actions and two signals. We have shown that 
emotional intelligence is necessary for the emergence of communications even 
in this simplest possible case. Numerical simulations demonstrate that if the 
rewards of the other agent are available for modelling, then signal-meaning 
associations can be learned quickly. The order of intentionality agents sup- 
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pose in their models about the other agent may give rise to problems, but 
the mere fact of the disagreement indicates that the models could be invalid. 
Novel situations may arise: agents might decide about their attitude towards 
other agents. 
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6 Appendices: Algorithms and pseudo codes 
A Explicit policy gradient method 

In this case the explicit reward functions are available for the two agents and 
they can calculate the gradients of the parameter sets M A = (a,pi,p 2 ) and 
M B = (P,q x ,qy): 

R A (M A , M B ) = a ■ (-c A ) + 2ap{ Pl - p 2 )(q x - q Y ) 

for agent A and 

R B (M A , M b )=/3- (-c b ) + 2a(3( Pl - p 2 )(q x - q Y ). 

for agent B. As can be seen from the equations, each agent also needs to 
estimate the parameters of the other agent in order to calculate its own 
expected reward. 

B General policy gradient method 

Let our policy n depend on the parameters summarized in a vector 6 £ R fc . 
Let X be the set of all possible trajectories in the task, and let r(X) denote 
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the reward collected in an episode. Then rj(9), the value of the policy ir(9), 
is the expected value of the reward: 

7/(0) = E [r(X)\ = r(x)q(0, x) 

X 

where E[.], denotes the expectation operator, x € X denotes a trajectory, 
r(x) denotes the reward collected while traversing trajectory x and q(9, x) is 
the probability of traversing trajectory x having parameters 6. The gradient 
of r}{9) with respect to 9 is: 



Vr,(9) = ^r(x)Vq(0,x) = J2r(x)^^q(9,x) = E 



r(X) 



Vq(9,X) 
q{9,X) 



A sequence of trajectories x 1 , x 2 , . . . , x n give an unbiased estimate of Vr/(6>): 



* 1 A , i ,Vq(9,x i ) 



Because of the law of large numbers: Vr/(6>) — > Vr](9) with probability 1. 
The quantity ^Tg^jr is called likelihood ratio or score function. 

Let the trajectory x be a sequence of states x±, x 2 , ■ ■ ■ , xt, and let 
Pxtxt+i (9) be the probability of moving from state x t to x t+ i having parame- 
ters 9. Then: 



Vq(9,x) 



T-l 

E 



Vp xtxt+1 (9) 



q(9,x) p Xt x t+1 (9) 
which can be derived the following way: 

T-l 



q(9,x) = Y[p XtXt+1 



t=o 



T-l 



T-l 



log q{9,x) = log Y[ Px t xt + i = Yl log Pxtxt+i 



t=0 



T-l 



t=0 
T-l 



V log q(9,x) = J2V\ogp XtXt+1 = Y 



xtxt+i 



t=0 



t=0 



Px t 



Xt + l 
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1G 



since Vlog/(x) = ■ This sum can also be accumulated iteratively. 

In our case the algorithm is simplified, since each episode consists of 
one step (agent A says something and agent B replies). Furthermore, we 
update the parameters after each episode, which means N = 1 in the above 
algorithm. This way the two cycles boil down to one line of update after 
each episode: 

Ps,a\P) 

The respective gradients and probabilities can be calculated from the param- 
eters a,p 1 ,p 2 ,i3,q x ,qy. 

In the tables the state or action denoted by means communicating 
nothing. 



C 1-step modelling 

In this case the agent calculates a conditional strategy that optimizes Ma 
and M B jointly, as discussed in the text. Recall, that by joint optimization 
we mean that we can calculate the conditional strategy 

M A \b{M b ) = arg max JU(Ma,M b ), 

Ma 

that is, A can calculate, that if B followed M Bl what would the optimal 
choice of A be. A can estimate the parameters of B and thus can estimate his 
policy, Mb = (/?, qx, qy)- The same is true vice versa, for agent B estimating 
the policy of A, Ma = (a,p\,p2). The parameters can be estimated by the 
agents observing each other's behavior, and approximating the parameters 
with their relative frequencies, that is, the ratio of the occurrence frequencies 
of certain actions: 

Then Ma\b{Mb) can be derived analytically, and is the following: 

• if .B's will to use communication (/3) is so low that it is not worth using 
communication for A because of his own cost, then do not communicate 
anything, 

• otherwise, if A is in state 1, and B is more likely to answer 1 to X than 
to Y (qx > qy), or if A is in state 2 and B is more likely to answer 2 
to X that to Y (qx < qy), then say X, 
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• otherwise say Y 

The conditional policy of agent B, M B \ A (M A ), is essentially the same, but 
using the estimated parameters of A (a,pi,p 2 ). 

D 2-step modelling 

Supposing that B uses 1-step modelling, A can think one step further. Based 
on that, he can simply choose his optimal strategy: 

M\ = argmax R A (M A ,M BlA (M A )). 

Ma 

This optimal policy can also be derived analytically, and is the following: 

• if -B's will to use communication (p) is so low that it is not worth 
using communication for A because of his own cost, or A's will to use 
communication (a) is so low that it is not worth using communication 
for B because of his own cost, then do not communicate anything, 

• otherwise, if A is in state 1, and p\ > p 2 (or if A is in state 2 and 
Pi < P2), then suppose that B traces this, and answers 1 (2) if A says 
X, so say X, 

• otherwise say Y 

Again, the optimal policy for agent B is essentially the same, using the 
other's parameters. 

E SARSA 

The SARSA algorithm builds a table and computes the value of each entries. 
For the description of the algorithm, see, e.g., (@; Q) and references therein. 
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e = 0.05, r,p e [0,1] 
for each test 

a, (3 = 0.75, initialize pi,P2, Qx, % to random values 
for each episode i = 1, MAX_EPISODES do 
Agent A 

update the approximation of the parameters of B: (3, q x , q y 
update own parameters by gradient: 

Aa = -ca + /3{r + p) + ${pi - p 2 ){qx -Qy){r-p) 

Api = a(3(q x - qy){r - p) 

Ap 2 = -ap(q x -qy)(r-p) 

a a + eAa 

Pi <— Pi + eApi 

P2 <- P2 + ^Ap 2 
Agent B 

update the approximation of the parameters of A: a,pi,p 2 
update own parameters by gradient: 

Abeta = —c B + 6t(r + p) + a(p 1 - p 2 )(qx - Qy)(t - p) 

Aq x = aPfa -P2){r-p) 

Aq y = -a(3{pi - p 2 )(r - p) 

P<- (3 + eA(3 

q x q x + 

q y <— q y + eAq y 

end for 
end for 



Tab. 1: Pseudo-code of the explicit policy gradient method 
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z = e R k , A = G M k 

for each episode j = 1, N do 

i? = o e R 

for each state transition x 4 - 

Zt+1 ~ Zt+ PX t X t+1 W 

Rt + i = Rt + T^(r t -Rt) 
end for 

A* 
end for 

9 



x t+ i do 



Tab. 2: Pseudo-code for the general policy gradient method 
Tab. 3: Gradients and probabilities for agent A 



state 


action 


Va 




Vp 2 


probability 


1 


X 


Pi 


a 





api 


1 


Y 


1 ~Pi 


—a 





a(l - pi) 


2 


X 


P2 





a 


ap 2 


2 


Y 


l~P2 





—a 


a(l -p 2 ) 


* 





-1 








(I -a) 



Tab. 4: Gradients and probabilities for agent B 



state 


action 


V/3 




Vg y 


probability 


X 


1 


Qx 


P 





PQx 


X 


2 


i - qx 


~P 





P(l ~ Qx) 


Y 


1 


Qy 





p 


Pqv 


Y 


2 


1 - q Y 





-p 


P(l - qy) 


* 





-1 








(1-/3) 





1/2 


0.5 








P 








-0.5 








(l-P) 
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^ # episodes where A had chosen to communicate 

# all episodes so far 

P # episodes where A said "X" in state "1' 



P2 



# all episodes so far where A had chosen to communicate 

# episodes where A said "X" in state "2" 

# all episodes so far where A had chosen to communicate 



a # episodes where B had chosen to listen 

^ # all episodes so far 

* # episodes where B guessed "1" after hearing "X" 

" # all episodes so far where B had chosen to listen 

~ # episodes where B guessed "1" after hearing "Y" 

# all episodes so far where B had chosen to listen 



Tab. 5: Estimating parameters 



if 2(3 < c A 

do not communicate 
otherwise 

if (A is in state 1 and qx > Qy) or (A is in state 2 and qx < Qy) 

say X 
otherwise 

say Y 
end if 
end if 



Tab. 6: Pseudo-code of the 1-step modelling method for agent A 



if 2/3 < ca or 2a < c B 

do not communicate 
otherwise 

if (A is in state 1 and pi > P2) or (A is in state 2 and pi < p 2 ) 

say X 
otherwise 

say Y 
end if 
end if 



Tab. 7: Pseudo-code of the 2-step modelling method for agent A 



E SARSA 



21 



References 

[1] Special Issue on the Emergence of Language, Connection Science 17 
(2005), no. 3-4, 185-397, Editor: A. Cangelosi. 

[2] D. P. Bertsekas and J. N. Tsitsiklis, N euro- dynamic programming, 
Athena Scientific, Belmont, MA, 1996. 

[3] R. Dunbar, Grooming, gossip, and the evolution of language, Harvard 
University Press, Cambridge, MA, 1997. 

[4] V. Gyenes, A. Bontovics, and M. Kiszlinger, Experimenting with emo- 
tional intelligence and echo-state networks, Technical Report ELU- 
November-2006-1, Eotos Lorand University, Budapest, H, 2006, unpub- 
lished. 

[5] M. Hauskrecht, Value-function approximations for partially observable 
markov decision processes, Journal of Artificial Intelligence Research 13 
(2000), 33-94. 

[6] G. Rummery, Problem solving with reinforcement learning, Ph.D. thesis, 
University of Cambridge, Cambridge, UK, 1995. 

[7] S. Singh, T. Jaakkola, M. L. Littman, and Cs. Szepesvari, Convergence 
results for single-step on-policy reinforcement-learning algorithms, Ma- 
chine Learning 38 (2000), 287-303. 

[8] L. Steels, Evolving grounded communication for robots, Trends in Cog- 
nitive Sciences 7 (2003), 308-312. 

[9] L. Steels and P. Vogt, Grounding adaptive language games in 
robotic agents, Proc. of European Conf. on Artificial Life (Cam- 
bridge Ma) (P. Harvey and P. Husbands, eds.), MIT Press, 1997, 
http:/ /www. ling. ed.ac.uk~paulv/ecal97.pdf 

[10] E. Stenius, Mood and language game, Synthese 17 (1967), 254-274. 

[11] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction, 
MIT Press, Cambridge, MA, 1998. 

[12] Sz. Szamado and E. Szathmary, Selective scenarios for the emergence of 
natural language, Trends in Ecology and Evolution (2006), (in press). 



E SARSA 



22 



[13] L. Wittgenstein, Philosophical investigations, Basil Blackwell, Oxford, 
UK, 1974, Translated by G. Anscombe. 



